OpenAI GPT OSS Quickstart

Get started with OpenAI’s GPT-OSS, duo that’s built for
Transparency and Control.

These versatile open-weight reasoning models are crafted for developers, researchers, and enterprises who require transparency, adaptability, and the freedom to customize - all while retaining powerful chain-of-thought reasoning capabilities. Both GPT-OSS models are trained to think step-by-step before producing an answer, enabling them to excel at complex reasoning tasks including coding challenges, mathematical problems, strategic planning, puzzles, and agent-based workflows. They also support adjustable reasoning effort levels, giving you fine-grained control to balance deeper reasoning performance with the computational cost that best fits your use case.

How to use GPT-OSS Inference API

These models are accessible exclusively to users on Build Tier 1 or higher. Because reasoning models generate longer, more detailed outputs through chain-of-thought processing, we recommend streaming tokens to ensure a smoother and more responsive user experience.

import requests
import json

url = "https://platform.qubrid.com/api/v1/qubridai/chat/completions"
headers = {
"Authorization": "Bearer Qubrid_API_KEY",
"Content-Type": "application/json"
}

data = {
"model": "openai-gpt-oss-120b",
"messages": [
  {
    "role": "user",
    "content": "Explain quantum computing to a 5 year old."
  }
],
"temperature": 0.7,
"max_tokens": 65536,
"stream": true,
"top_p": 0.8
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

This will produce the response below:

{
    "id": "chatcmpl-4968cd8b586c496bbef3bba04edbbe75",
    "object": "chat.completion",
    "created": 1764850905,
    "model": "openai/gpt-oss-120b",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "## Quantum Computing – A Primer\n\n### 1. What Is a Quantum Computer?\nA **quantum computer** is a device that performs computation using the laws of quantum mechanics instead of the classical physics that underlies today’s laptops and smartphones.  \nWhile a classical computer manipulates bits that are either **0** or **1**, a quantum computer manipulates **quantum bits** (or **qubits**) that can exist in a superposition of both states simultaneously. This property, together with **entanglement** and **interference**, lets a quantum processor explore many computational paths in parallel, potentially solving certain problems far faster than any classical machine.\n\n---\n\n### 2. Core Quantum Concepts\n\n| Classical Idea | Quantum",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning_content": "We need to respond with an explanation of quantum computing. The system messages: we are ChatGPT, we have to follow policies. There's no disallowed content. Provide a clear, thorough explanation, possibly with analogies, basics of qubits, superposition, entanglement, quantum gates, algorithms, challenges, current state, future. Should be accessible but also technical. Use headings. Possibly ask follow-up. Provide references. Ensure no copyrighted text. Provide original content. Let's produce a comprehensive answer."
            },
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "token_ids": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 75,
        "total_tokens": 331,
        "completion_tokens": 256,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null,
    "prompt_token_ids": null,
    "kv_transfer_params": null
}

To access just the chain-of-thought reasoning you can look at the reasoning property:

We need to respond with an explanation of quantum computing. The system messages: we are ChatGPT, we have to follow policies. There's no disallowed content. Provide a clear, thorough explanation, possibly with analogies, basics of qubits, superposition, entanglement, quantum gates, algorithms, challenges, current state, future. Should be accessible but also technical. Use headings. Possibly ask follow-up. Provide references. Ensure no copyrighted text. Provide original content. Let's produce a comprehensive answer.

Available Models

Two capable open-weight models are available to meet different deployment needs: GPT-OSS 120B:

Model String: openai/gpt-oss-120b
Hardware Requirements: Fits on 80GB GPU
Architecture: Mixture-of-Experts (MoE) with token-choice routing
Context Length: 128k tokens with RoPE
Best for: Enterprise applications requiring maximum reasoning performance

GPT-OSS 20B:

Model String: openai/gpt-oss-20b
Hardware Requirements: Lower GPU memory requirements
Architecture: Optimized MoE for efficiency
Context Length: 128k tokens with RoPE
Best for: Research, development, and cost-efficient deployments

GPT-OSS Best Practices

Reasoning models like GPT-OSS should be used differently than standard instruct models to get optimal results: Recommended Parameters:

Reasoning Effort: Use the adjustable reasoning effort levels to control computational cost vs. accuracy.
Temperature: Use 1.0 for maximum creativity and diverse reasoning approaches.
Top-p: Use 1.0 to allow the full vocabulary distribution for optimal reasoning exploration.
System Prompt: The system prompt can be provided as a developer message which is used to provide information about the instructions for the model and available function tools.
System message: It’s recommended not to modify the system message which is used to specify reasoning effort, meta information like knowledge cutoff and built-in tools.

Prompting Best Practices: Think of GPT-OSS as a senior problem-solver – provide high-level objectives and let it determine the methodology:

Strengths: Excels at open-ended reasoning, multi-step logic, and inferring unstated requirements
Avoid over-prompting: Micromanaging steps can limit its advanced reasoning capabilities
Provide clear objectives: Balance clarity with flexibility for optimal results

GPT-OSS Use Cases

Code Review & Analysis: Comprehensive code analysis across large codebases with detailed improvement suggestions
Strategic Planning: Multi-stage planning with reasoning about optimal approaches and resource allocation
Complex Document Analysis: Processing legal contracts, technical specifications, and regulatory documents
Benchmarking AI Systems: Evaluates other LLM responses with contextual understanding, particularly useful in critical validation scenarios
AI Model Evaluation: Sophisticated evaluation of other AI systems with contextual understanding
Scientific Research: Multi-step reasoning for hypothesis generation and experimental design
Academic Analysis: Deep analysis of research papers and literature reviews
Information Extraction: Efficiently extracts relevant data from large volumes of unstructured information, ideal for RAG systems
Agent Workflows: Building sophisticated AI agents with complex reasoning capabilities
RAG Systems: Enhanced information extraction and synthesis from large knowledge bases
Problem Solving: Handling ambiguous requirements and inferring unstated assumptions
Ambiguity Resolution: Interprets unclear instructions effectively and seeks clarification when needed

Managing Context and Costs

Reasoning Effort Control

GPT-OSS features adjustable reasoning effort levels to optimize for your specific use case:

Low effort: Faster responses for simpler tasks with reduced reasoning depth
Medium effort: Balanced performance for most use cases (recommended default)
High effort: Maximum reasoning for complex problems requiring deep analysis. You should also specify max_tokens of ~30,000 with this setting.

Token Management

When working with reasoning models, it’s crucial to maintain adequate space in the context window:

Use max_tokens parameter to control response length and costs
Monitor reasoning token usage vs. output tokens - reasoning tokens can vary from hundreds to tens of thousands based on complexity
Consider reasoning effort level based on task complexity and budget constraints
Simpler problems may only require a few hundred reasoning tokens, while complex challenges could generate extensive reasoning

Cost/Latency Optimization

Implement limits on total token generation using the max_tokens parameter
Balance thorough reasoning with resource utilization based on your specific requirements
Consider using lower reasoning effort for routine tasks and higher effort for critical decisions

Technical Architecture

Model Architecture

MoE Design: Token-choice Mixture-of-Experts with SwiGLU activations for improved performance
Expert Selection: Softmax-after-topk approach for calculating MoE weights, ensuring optimal expert utilization
Attention Mechanism: RoPE (Rotary Position Embedding) with 128k context length
Attention Patterns: Alternating between full context and sliding 128-token window for efficiency
Attention Sink: Learned attention sink per-head with additional additive value in the softmax denominator

Tokenization

Standard Compatibility: Uses the same tokenizer as GPT-4o
Broad Support: Ensures seamless integration with existing applications and tools

Context Handling

128k Context Window: Large context capacity for processing extensive documents
Efficient Patterns: Optimized attention patterns for long-context scenarios
Memory Optimization: GPT-OSS Large designed to fit efficiently within 80GB GPU memory

Getting started

GPU Compute

Inferencing

AI Tools

How to use GPT-OSS Inference API

Available Models

GPT-OSS Best Practices

GPT-OSS Use Cases

Managing Context and Costs

Reasoning Effort Control

Token Management

Cost/Latency Optimization

Technical Architecture

Model Architecture

Tokenization

Context Handling

Getting started

GPU Compute

Inferencing

AI Tools

​How to use GPT-OSS Inference API

​Available Models

​GPT-OSS Best Practices

​GPT-OSS Use Cases

​Managing Context and Costs

​Reasoning Effort Control

​Token Management

​Cost/Latency Optimization

​Technical Architecture

​Model Architecture

​Tokenization

​Context Handling

How to use GPT-OSS Inference API

Available Models

GPT-OSS Best Practices

GPT-OSS Use Cases

Managing Context and Costs

Reasoning Effort Control

Token Management

Cost/Latency Optimization

Technical Architecture

Model Architecture

Tokenization

Context Handling